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Abstract. (Jaynes') Method of (Shannon-Kullback's) Relative Entropy Maximization (REM or MaxEnt) can be 
- at least in the discrete case - according to the Maximum Probability Theorem (MPT) viewed as an asymptotic 
instance of the Maximum Probability method (MaxProb). A simple bayesian interpretation of MaxProb is given 
here. MPT carries the interpretation over into REM. 



INTRODUCTION 

Relationship of the Method of (Shannon-Kullback's) 
Relative Entropy Maximization (REM or MaxEnt) 
and Bayesian Method is notoriously peculiar. The 
two methods of induction are viewed as unrelated at 
all, or opposed, or identical in some circumstances, 
or one as a special case of the other one (see [8]). 

As it was noted, a finding that REM can be 
viewed as an asymptotic instance of Maximum 
Probability method (MaxProb, cf. [3]) implies that 
MaxProb/REM/MaxEnt cannot be in conflict with 
Bayes' Theorem (cf. [4]). 

A beautiful, simple (yet in some extent over- 
looked) bayesian interpretation of REM which oper- 
ates on the level of samples and employs Conditioned 
Weak Law of Large Numbers (CWLLN) was sug- 
gested and elaborated at [2]. Csiszar's original argu- 
ment together with the Maximum Probability The- 
orem (MPT, see [3], Thm 1), inspired a bayesian in- 
terpretation of MaxProb and REM methods, which 
we intend to present here. 

TERMINOLOGY AND NOTATION 

Let X = {xi ,xz, ■ ■ ■ ,x m } be a discrete finite set called 
support, with m elements and let {Xt,l = 1 ,2, ... ,n} 
be a sequence of size n of identically and indepen- 
dently drawn random variables taking values in X. 

A type v n = [ni ,Ti2, . . . ,n m ]/n is an empirical 
probability mass function which can be based on se- 



quence {Xi.,1 = 1,2,... ,n}. Thus, n.t denotes number 
of occurrences of i-th element of X in the sequence. 

Let y(X) be a set of all probability mass functions 
(pmf's) on X. Let Tin C y(X) be a set of all types 
v n , and let 'Kn C n n . 

Let the supposed source of the sequences (and 
hence also of types) be q G T(X). 

Let 7t("v n ) denote the probability that q will gen- 
erate type y n , ie. n{v n ) = n , ij'..^, LTtHi 

BAYESIAN INTERPRETATION OF 
MAXIMUM PROBABILITY 
METHOD 

Bayesian recipe prescribes to update prior distribu- 
tion (information) by an evidence via Bayes' The- 
orem (BT) to get a posterior distribution. Usually 
bayesians use BT to update prior distribution of a 
parameter by evidence which has form of random 
sample and obtain posterior distribution of the pa- 
rameter, given the sample. Then it is customary to 
select the value of parameter at which the posterior 
distribution attains its maximum (i.e. mode) and 
perform further inference. 

The bayesian recipe and [2] will be followed here 
on a different level. A prior distribution of types will 
be updated via BT by data of special form. Then 
the maximum aposteriori type will be searched out. 

The bayesian updating will be carried out in four 
steps: 

Step 1: Select a probability mass function q which 



could be the best guess of source of types "v n . It 
will specify a prior probability P("V n ) of type by 
the following simple scheme: P("v n ) = 7t(-v n ). Thus, 
7t[-v n ) is the apriori distribution of types, which is 
going to be updated once an evidence (data) will 
become available. 

Step 2: The data arrive in rather special form: they 
specify a set J{ n of types ~v n (which were observed, 
or 'feasible' in some general way). In other words, 
the evidence is that types which do not belong to 
3i~ n cannot be observed, or are 'not feasible'. 

Step 3: Use Bayes' Theorem to update the prior 
probability of type 7r(type = v n ) by the evidence 
"type G J{ n " to obtain the posterior probability 
P(type = -v-Jtype G Jf n ) that type is equal to -v n 
given that it conforms with the evidence (i.e. belongs 
to Jf n ). 

P(type = ~v n |type G Ji n ) = 
P(type G 3{ n |type = -y n ) 7t(type = -v n ) 
P(type G Jf n ) 

Note that P(type G 5{ n |type = y n ) is if -v n £ K n 
and 1 otherwise. Thus, for "v n G "K n the aposteriori 
probability is 



P(type = Y n |typeG ^ 



7t(type = AsQ 
P(type G Jf n ) 



Obviously, P(type G Ji n ) = H^ e M n n(v n ). 

Step 4: The type(s) with the highest value of the 
posterior probability (MAP type) is to be searched 
out. Since types which do not belong to "K n have zero 
posterior probability, a search for the MAP type can 
be restricted to types which belong to JC n . So, the 
MAP type ^ n is 

Y n = arg max P(type = -v n |type G "K n ) 

Since, for fixed n and any ~v n , P(type G !H n ) is a 
constant, the MAP type turns to be 



"v n = arg max 7r(type = v r 



(1) 



Thus the MAP type i> n is just the type in "K n 
which has the highest value of the prior probability. 
Here it stops. 

Observe that (1) is identical with prescription of 
the Maximum Probability (MaxProb) method (cf. 
[3] ) . Thus the above reasoning provides its bayesian 
interpretation. 



HOW DOES IT RELATE TO 
REM/MAXENT? 

Via Maximum Probability Theorem (MPT, see [3], 
Thm 1 and [6]). 

Before stating MPT, I-projection has to be 
defined. I-projection fi of q on set IT C CP(X) 
is such p~ G 17 that I(£||q) = inf pen I(p|| q), 
where 1 I(p||q) = Y.xVi^°&^- is the I-divergencc. 
I-divergcnce is known under various other names: 
Kullback-Leibler's distance, KL number, Kullback's 
directed divergence, etc. When taken with minus 
sign it is known as (Shannon-Kullback's) relative 
entropy. 

(MPT) 2 Let differentiable constraint F(-v n ) = 
define feasible set of types "K n and let l K = {\)\ F(p) = 
0} be the corresponding feasible set of probability 
mass functions. Let "v> n = arg max Vn£ j{ n n("V n ). Let 
^ be l-projection of q on "K. And let n — > oo. Then 

MPT shows that REM is an asymptotic instance 
of MaxProb method. Thus MPT carries the bayesian 
interpretation of MaxProb over into REM/MaxEnt. 
Hence, I-projection is just the MAP type which 
results from the bayesian updating which was de- 
scribed at the previous Section, in the case of suffi- 
ciently large n. 

To sum up: Whenever n is sufficiently large and 
prior will be assigned to types "v^ as in the Step I, 
and new data will take form as in the Step 2, and 
the prior will be updated by the data via BT as in 
the Step 3, and MAP type will be searched out as in 
the Step 4, then the MAP type will be nothing but 
the REM I-projection of q on "K. 



DISCUSSION 

Why MAP? Why not say median aposteriori type? 
The MAP type becomes when n — > oo just the I- 
projection. If the I-projection is unique then Con- 
ditioned Weak Law of Large Numbers (CWLLN, 
cf. [13], [12], [7], [1], [11], [9], [10]) can be invoked. 
If read in the above bayesian manner, it says that 
any other type/distribution than I-projection has 
asymptotically zero posterior probability. So, this is 



1 There, log0 = — oo, log-^- =+oo, 0- (±00) =0, conventions 
are assumed. Throughout the paper log denotes the natural 
logarithm. 

2 Originally MPT was stated with unique I-projection case in 
mind. Its proof however readily allows to state it in general 
form (see [6]). Since the issue of uniqueness is at this Section 
irrelevant the MPT will be stated at its original form. 



why MAP and not median. However, what if there 
are multiple I-projections? Obviously, the bayesian 
interpretation of MaxProb is valid regardless of the 
number of MAP types. MPT in its general form (cf. 
[6] ) covers also the case of multiple MaxProb types 
and claims that they converge to I-projections. Then 
one can either recall Entropy Concentration Theo- 
rem (cf. [6] ) or invoke an extension of CWLLN which 
covers also the case of multiple I-projections (cf. [5]) 
- to answer the "Why MAP" question in the general 
case. 



CONCLUDING NOTE 

Originally (cf. [3]), MaxProb was presented as 
a method which looks in Jf n for a type ^ n = 
argmax-v n£ j<; n 7t("Vn) which the 'prior' generator q 
can generate with the highest probability. The word 
'prior' was used merely to mean that the generator is 
selected before the data arrive. Alternatively, since 
unconstrained maximization of the conditional prob- 
ability P(type = -v n |type £ JC n ) reduces to maxi- 
mization of 7r("v n ) constrained to -v n G Jf n , Max- 
Prob could be interpreted as search for the type with 
the highest value of the conditional probability. The 
third, bayesian interpretation of MaxProb - inspired 
by [2] - was given here. Obviously, MPT stands re- 
gardless of what is the preferred interpretation of 
MaxProb. 
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Corrections wrt Version 1: i) Obviously, P(type G Jf n ) 
is not given as a ratio of the number of types in 3i n 
to the number of all types in n n ; rather it is P(type € 
"K n ) = H Y T lgMn 7t(v n ). The rest of argument remains 
untouched by the gross lapse, ii) The second question 
from Discussion (Sect. 5) from the Version 1 is not 
included here. 

In the form of Version 1, the paper appeared as M. 
Grendar, Jr. and M. Grendar, Maximum Probability and 
Maximum Entropy methods: Bayesian interpretation, in 
Bayesian Inference and Maximum Entropy methods in 
Science and Engineering, G. Erickson and Y. Zhai (eds.), 
AIP, Melville, pp. 490 -495, 2004. 
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